Introduction

Train delays can significantly impact the efficiency of urban transit systems by affecting passengers’ schedules especially around large metropolitans like New York City where a large portion of people relies on trains for commute. However, if we can be informed of the delays ahead of time, the chances that they will affect our travels will be much lower. Therefore, having a better understanding of and being able to predict these potential delays is crucial for the health of an urban transit system. In this study, we will be using a dataset that records train delay data of NJ Transit and Amtrak rails from 2018 to 2020 to conduct our exploratory analysis and build predictive models. A proposal for a mobile application will also be presented to make the technical model accessible for everyday passengers who need it the most.

Exploratory Analysis

But before jumping right into the model building part, we first must conduct a thorough exploratory analysis on the data we plan to work with. In this portion of the report, we will be analyzing the weather factor (precipitation and temperature), fixed effects (time of the day, day of the week, delay distribution), and the temporal as well as the spatial lags to see how the delays respond to each of these factors.

Weather

Since most of the NJ transit stations are in New Jersey state, weather data from Newark Airport (EWR) is used. The following time series plots displays the precipitation and temperature for each hour in EWT during March 2018. The plots reveals a few days of precipitation.

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  #ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    #labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - EWR - March, 2018")

Describe and Explore the Data

The Time Series Plot below illustrate Bike share tripe per hours in Jersey City. The skyblue line is used to visualize the Friday. In can see the relatively high delay occur in mid-March.

merged_dataset%>%
  dplyr::select(interval60, from, delay_minutes) %>%
  gather(Variable, Value, -interval60, -from) %>%
    group_by(Variable, interval60) %>%
    summarize(Value = mean(Value))%>%
    ggplot(aes(interval60, Value)) + 
    #geom_vline(data = monday, aes(xintercept = monday), color = "red") +
    geom_vline(data = Friday, aes(xintercept = Friday), color = "skyblue") +
    geom_line(size = 0.8,colour="darkblue")+
      labs(title = "Delay distribution in A Month", subtitle = "NJ, March, 2018",  x = "Day", y= "Mean Delay") +
     theme_minimal()

  plotTheme

Fixed effects

grid.arrange(ggplot(data = delay_day, aes(x = dotw, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in a week", x = "Day of The Week", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_hour, aes(x = hour, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in 24 hours", x = "Hour in a day", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_sequence, aes(x = stop_sequence, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in each sequence", x = "Stop Sequence", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_time, aes(x = time_of_day, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in a week", x = "Day of The Week", y = "Mean Delay") +
  theme_minimal(),
  
  nrow=2)

The following time series plots display delay minutes in NJ Transit, broken down by day of the week and by weekdays versus weekends. The data shows a peak in delay minnutes on Sunday, with overall higher delay on weekends compared to weekdays. This trend indicates that delay situation is more commonly happen in weekend.

delay_week_hour <- merged_dataset %>%
  group_by(weekend,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

ggplot(data = delay_week_hour, aes(x = hour, y = mean_delay, color = weekend)) +
  geom_line() +
  scale_color_manual(values = palette2) +
  labs(title = "The delay under week and time", x = "Hour", y = "Delay Minutes") +
  theme_minimal()

palette <- c(
  "#1b9e77", # Teal green
  "#d95f02", # Vibrant orange
  "#7570b3", # Muted purple
  "#e7298a", # Bright pink
  "#66a61e", # Leaf green
  "#e6ab02", # Golden yellow
  "#a6761d"  # Warm brown
)

delay_week_day <- merged_dataset %>%
  group_by(dotw,hour(interval60))%>%
      filter(!is.na(dotw)) %>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

ggplot(data = delay_week_day, aes(x = hour, y = mean_delay, color = dotw)) +
  geom_line() +
  scale_color_manual(values = palette) +
  labs(title = "The delay under week and time", x = "Hour", y = "Delay Minutes") +
  theme_minimal()

The maps below illustrate the hourly delays by station. Weekends tend to experience relatively higher delays compared to weekdays. Overall, delayed stations appear to be spatially distributed in a random pattern, without a clear clustering or specific geographic concentration.

ggplot() +
  # Plot NJTracts as a base map
  geom_sf(data = NJTracts %>% 
            st_transform(crs = 4326), color = "white") +
  
  # Add points for bike share trips
  geom_point(data = merged_dataset %>% 
               mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(
                        hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                        hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                        hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                        hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"
                      )) %>%
               filter(!is.na(time_of_day)) %>%
               group_by(from_id, from_lat, from_lon, weekend, time_of_day) %>%
               summarize(mean_delay = mean(delay_minutes, na.rm = TRUE), .groups = "drop") %>%
               ungroup(),
             aes(x = from_lon, y = from_lat, color = mean_delay), 
        
             alpha = 0.9, size = 0.9) +
  
  # Add color scale for mean delay
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "E") +
  
  # Set axis limits
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat)) +
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon)) +
  
  # Add facets for weekend and time of day
  facet_grid(weekend ~ time_of_day) +
  
  # Add labels and themes
  labs(title = "delay min by station. NJ, march, 2018") +
  theme_minimal() +
   # Remove longitude and latitude from axes
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  theme_void()

The plot below displays an animated map of station delays, revealing a relatively high frequency of delays in the northern area near Manhattan, NY. This suggests that stations closer to Manhattan experience more frequent delays, potentially due to higher traffic, congestion, or operational complexity in this region.

delay_stop_time <- merged_dataset %>%
  group_by(from,to,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')%>%
  left_join(stop,by=c('from'='STATION_ID'))%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)

ggplot() +
    geom_sf(data = NJTracts, color = 'white') + 
    geom_sf(data = delay_stop_time, aes(size = mean_delay,color = mean_delay), 
            fill = "transparent", alpha = 0.8) +
  
    scale_colour_viridis(
          name = "Mean delay", # Combined title for color and size
    direction = -1,
    option = "E",
    discrete = FALSE
  ) +
  scale_size_continuous(
    name = "Mean delay", # Same title as color to combine legend
    range = c(1, 5) # Adjust size range as needed
  ) +
  guides(
    color = guide_legend() # Use a single legend guide
  ) +

    #scale_colour_viridis(direction = -1,discrete = FALSE, option = "D") +
    labs(title = "Station Delay For One Day in March, 2018",
         subtitle = "Hours in a day: {current_frame}") +
    transition_manual(hour)+ mapTheme+theme_minimal()+  theme_void()

Temporal & Spatial lag

Creating Spatial and time lag variables capture the dependence of phenomena on geospatial and time series, thereby improving the predictive accuracy and explanatory power of the model. In practical applications, these variables can help us better understand the dynamic nature of delay and external influences.

The time lag variables considers how a phenomenon at one point in time is affected by a previous point in time.

The following scatterplots present that Temporal lag has correlation with the delay.

Short time lags (such as lag15min and lag30min) have a greater impact on current delays and should be prioritized as features in the model. However, Long-term lags (e.g., lag2h and lag3h) are less relevant but still capture long-term delay trends.

# Prepare the data
scatter_data <- as.data.frame(ride.panel_min) %>%
  group_by(interval15) %>% 
  summarise_at(vars(starts_with("lag"), "delay_minutes"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval15, -delay_minutes) %>%
  mutate(Variable = factor(Variable, levels = c("lag15min", "lag30min", "lag45min", "lag1h",
                                                "lag1h15min", "lag1h30min", "lag1h45min", "lag2h", "lag2h15min",
                                                "lag2h30min", "lag2h45min", "lag3h") )) %>% 
    filter(!is.na(Variable))


# Compute correlation values
correlations <- scatter_data %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2)) %>% 
  filter(!is.na(Variable))

# Create scatterplot
ggplot(scatter_data, aes(x = Value, y = delay_minutes)) +
  geom_point(color = "grey20", alpha = 0.6, size = 1) + # Uniform color for points
  geom_smooth(method = "lm", color = "red", se = FALSE, size = 0.5, linetype = "dashed") + # Best-fit line
  facet_wrap(~Variable, scales = "free", nrow=2) + # Separate scatterplots for each Variable
  labs(
    title = "Scatterplot of Lag Variables vs Delay Minutes",
    x = "Lag Value",
    y = "Delay Minutes"
  ) +
  theme_minimal() +
  # Add correlation values as annotations
  geom_text(
    data = correlations,
    aes(label = paste("r =", correlation), x = Inf, y = Inf),
    hjust = 1.1, vjust = 1.2, inherit.aes = FALSE
  )

Spatial lag variables consider how phenomena in one location are affected by neighboring locations, especially in geographic space. The following scatterplots indicate that spatial lag are very strong correlation with the delay. However, the correlation are decreasing with each additional lag stations, but the predictive power are stronger in overall.

# Prepare the data
scatter_data_station <- as.data.frame(ride.panel_station) %>%
  group_by(train_id) %>% 
  summarise_at(vars(starts_with("lags"), "delay_minutes"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -train_id, -delay_minutes) %>%
  mutate(Variable = factor(Variable, levels = c(
    "lagsstation", "lags2station", "lags3station", "lags4station",
    "lags5station", "lags6station", "lags7station", "lags8station"
  )))

# Compute correlation values
correlations_station <- scatter_data_station %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2))

# Create scatterplot
ggplot(scatter_data_station, aes(x = Value, y = delay_minutes)) +
  geom_point(color = "grey20", alpha = 0.6, size = 2) + # Uniform color for points
  geom_smooth(method = "lm", color = "red", se = FALSE, size = 1, linetype = "dashed") + # Best-fit line
  facet_wrap(~Variable, scales = "free", nrow=1) + # Separate scatterplots for each Variable
  labs(
    title = "Scatterplot of Station Lags vs Delay Minutes",
    x = "Lag Value",
    y = "Delay Minutes"
  ) +
  xlim(0,25)+
ylim(0,25)+
  theme_minimal() +
  # Add correlation values as annotations
  geom_text(
    data = correlations_station,
    aes(label = paste("r =", correlation), x = Inf, y = Inf),
    hjust = 1.1, vjust = 1.2, inherit.aes = FALSE
  )

Model

This section split data into a training and test set and develop a 3 week training set and a 2 week test set of all the stations. In the following analysis, Three different linear regression are estimated on delay.Train.

Since delays exhibit spatial and temporal correlations, incorporating time and space features into the models can lead to improved predictions. Additionally, variables such as the origin and destination stations (from and to), whether it is a weekday or weekend, temperature, precipitation, and stop sequence are also included to further enhance the model’s accuracy.

  1. reg.30 (Predict delays within 30 minutes)
  • Shorter lag variables are used, such as lag45min, lag1h, lag1h15min, etc. These lag variables correspond to the possible effects in the target time range.

  • At the same time, lags3station, lags4station, lags5station, and lags6station contain lagging site variables, indicating the propagation of delay in space.

  • It is suitable for analyzing the prediction of delays in a shorter time range (within 30 minutes).

  1. reg.60 (Forecast delay within 60 minutes)
  • Some lag variables of shorter time (such as lag45min) are omitted, and more attention is paid to lag variables of more than 1 hour, such as lag1h30min, lag2h, lag2h30min, etc.

  • Simplified site lag variables, including only lags5station and lags6station.

  • It is suitable for analyzing predictions of medium time range (within 60 minutes) delays.

  1. reg.90 (Predict delays within 90 minutes)
  • Focus on longer time range lag variables, such as lag1h30min, lag2h, lag3h, and remove short time lag variables (such as lag45min).

  • Only one site lag variable, lags6station, is retained.

  • Suitable for analyzing predictions of long time ranges (within 90 minutes).

delay.Train <- filter(ride.panel_station, week <= 11)
delay.Test <- filter(ride.panel_station, week > 11)


reg.30 <- 
  lm(delay_minutes ~ from + to + weekend + Temperature + Precipitation + lag45min 
     + lag1h + lag1h15min + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + 
       lag2h45min + lag3h 
     + lags3station+ lags4station+ lags5station + stop_sequence  , 
     data=delay.Train)

reg.60 <- 
  lm(delay_minutes ~  from + to + weekend + Temperature + Precipitation 
    + lag1h + lag1h15min + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + lag2h45min + lag3h + lags5station + lags6station + stop_sequence , 
     data=delay.Train)

reg.90 <- 
  lm(delay_minutes ~  from + to  + weekend + Temperature + Precipitation 
    + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + lag2h45min + lag3h+ lags6station + stop_sequence , 
     data=delay.Train)
delay.Test.weekNest <- 
   delay.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

The Bar charts below illustrates the MAE values for the Three models. It displays the highest MAE value in the 90 min delay model, and the lowest MAE value in the 30-min delay model. This indicates that the 30-minute model is the most accurate among the three.

week_predictions <- 
  delay.Test.weekNest %>% 
        mutate(mod.30 = map(.x = data, fit = reg.30, .f = model_pred),
           mod.60 = map(.x = data, fit = reg.60, .f = model_pred),
           mod.90 = map(.x = data, fit = reg.90, .f = model_pred),
           ) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

The time series plot below shows the predicted and observed mean delay. Overall, the models tend to under-predict the observed values. Among all regression models, the 30-minute model is the most accurate, closely capturing the observed delay patterns. The longer time prediction (such as 60 and 90 min model) typically introduce greater uncertainty, so the accuracy decrease while the prediction horizon increases (60 and 90 minutes).

delay.Test_5 <- delay.Test %>%
  mutate(pre_5 = predict(reg.30, newdata = delay.Test),
         abosulte_error = abs(pre_5 - delay_minutes),
         MAE = mean(abosulte_error, na.rm = TRUE),
         sd_AE = sd(abosulte_error, na.rm = TRUE),
         per_error = (pre_5 - delay_minutes) / delay_minutes,
         per_error = ifelse(per_error == Inf, 0, per_error),
         per_error = ifelse(per_error == -Inf, 0, per_error)) %>%
  rename(mod_30 = pre_5)

delay.Test_6 <- delay.Test%>%
  mutate(pre_6 = predict(reg.60, newdata = delay.Test),
         abosulte_error = abs(pre_6 - delay_minutes),
         MAE = mean(abosulte_error,na.rm = TRUE),
         sd_AE = sd(abosulte_error,na.rm = TRUE),
         per_error = (pre_6 - delay_minutes)/delay_minutes,
         per_error = ifelse(per_error == Inf,0,per_error),
         per_error = ifelse(per_error == -Inf,0,per_error))%>%
  rename(mod_60 = pre_6)

delay.Test_7 <- delay.Test%>%
  mutate(pre_7 = predict(reg.90, newdata = delay.Test),
         abosulte_error = abs(pre_7 - delay_minutes),
         MAE = mean(abosulte_error,na.rm = TRUE),
         sd_AE = sd(abosulte_error,na.rm = TRUE),
         per_error = (pre_7 - delay_minutes)/delay_minutes,
         per_error = ifelse(per_error == Inf,0,per_error),
         per_error = ifelse(per_error == -Inf,0,per_error))%>%
  rename(mod_90 = pre_7)

grid.arrange(
delay.Test_5 %>%
  dplyr::select(interval60, from, delay_minutes, mod_30) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "30 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),

delay.Test_6 %>%
  dplyr::select(interval60, from, delay_minutes, mod_60) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "60 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),

delay.Test_7 %>%
  dplyr::select(interval60, from, delay_minutes, mod_90) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "90 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),
 ncol=1)

The maps below present the MAE (Mean Absolute Error) values for three different models. As observed, the southern stations exhibit relatively higher MAE values compared to other regions, indicating greater prediction errors in these areas. Moreover, the maps below that still illustrates the 30-min model is the most accurate predictions, with smaller and more localized errors.

# Combine the data for all models into one data frame
all_models_data <- week_predictions %>%
  mutate(interval15 = map(data, pull, interval15),
         from_id = map(data, pull, from_id), 
         from_lat = map(data, pull, from_lat), 
         from_lon = map(data, pull, from_lon)) %>%
  select(interval15, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression %in% c("mod.30", "mod.60", "mod.90")) %>% # Include only the models of interest
  group_by(from_id, from_lon, from_lat, Regression) %>%
  summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE), .groups = "drop") %>%
  mutate(Model = case_when(
    Regression == "mod.30" ~ "30 min",
    Regression == "mod.60" ~ "60 min",
    Regression == "mod.90" ~ "90 min"
  ))

# Plot with facet_wrap
ggplot(all_models_data) +
  geom_sf(data = NJ_Census, color = "grey50", fill = "grey70") +
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_color_gradientn(name = "Mean Absolute Error", colors = c("lightyellow", "red")) +
  facet_wrap(~Model) +
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat)) +
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon)) +
  labs(title = "Mean Absolute Error by Model") +
  mapTheme

The maps below display the distribution of MAE values for NJ transit stations during different time periods on weekdays and weekends using the 30-minute model. Overall, weekends exhibit relatively higher MAE values compared to weekdays. Meanwhile, the model performs better during weekdays, as indicated by lower MAE values across most stations and time periods.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, from_id, from_lon, 
           from_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "mod.30")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(from_id, weekend, time_of_day, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = NJ_Census, color = "white")+
   geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.5, size = 3) +
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "E") +
  ylim(min(dat_census$from_lat), max(dat_census$from_lat))+
  xlim(min(dat_census$from_lon), max(dat_census$from_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, 30 min Model")+
  mapTheme

Conclusion

All three of our predictive models can do a decent job at predicting potential delays for NJ transit trains with a MAE of less than 10 minutes. With the help of the mobile app, people who commute to work or school, those who frequently travel using public transit, and those with a tight schedule can benefit from the relative accurate predicted delay time provided by our models. Not only could this help them save time and stay away from unexpected delays, but it can also help them with their traveling behavior while offering them with alternative plans. To make the analysis better, we can always include more relevant information and factors that might affect train delay time for a more accurate and comprehensive understanding of the problem we are dealing with. We can dig deeper into the study and analyze the delay times for each of the stations to find delay patterns that we might overlooked with our macro approach. Another thing we can do is to include data from the adjacent cities’ transit system to understand how they affect the delay times in New Jersey.

---
title: "Forecast Metro train delays in and around NYC"
date: "2024-12-12"
author: Jiatong Su, RunChuang Ma
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    progress: hide
    code_folding: hide
    code_download: yes
  params:
  include_warnings: false  # Add this line to suppress warnings
---

# Introduction

Train delays can significantly impact the efficiency of urban transit systems by affecting passengers’ schedules especially around large metropolitans like New York City where a large portion of people relies on trains for commute. However, if we can be informed of the delays ahead of time, the chances that they will affect our travels will be much lower. Therefore, having a better understanding of and being able to predict these potential delays is crucial for the health of an urban transit system. In this study, we will be using a dataset that records train delay data of NJ Transit and Amtrak rails from 2018 to 2020 to conduct our exploratory analysis and build predictive models. A proposal for a mobile application will also be presented to make the technical model accessible for everyday passengers who need it the most.

-   Please check the [video presentation](https://youtu.be/3bgx7cLofmI) here.

```{r setup, warning=FALSE, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(ggplot2)
library(kableExtra)
library(dplyr)
library(geosphere)
library(gganimate)


plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm")
                  )

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
palette7 <- c("red", "blue", "green", "orange", "purple", "yellow", "pink")


tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)

njtransit <- read_csv("/Users/jiatong/Desktop/2018_03.csv") 

```

```{r, include=FALSE}
stop <- st_read("https://services6.arcgis.com/M0t0HPE53pFK525U/arcgis/rest/services/NJTransit_Rail_Stations/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") %>%
  mutate(STATION_ID = ifelse(STATION_ID == 'Atlantic City', 'Atlantic City Rail Terminal', STATION_ID),
         STATION_ID = ifelse((STATION_ID == 'Middletown')&(COUNTY == 'Orange, NY'), 'Middletown NY', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Princeton Jct.', 'Princeton Junction', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Secaucus Junction Upper Level', 'Secaucus Upper Lvl', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Anderson Street-Hackensack', 'Anderson Street', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Bay Street-Montclair', 'Bay Street', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Broadway', 'Broadway Fair Lawn', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Essex Street-Hackensack', 'Essex Street', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Glen Rock-Boro Hall', 'Glen Rock Boro Hall', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Glen Rock-Main', 'Glen Rock Main Line', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Hoboken Terminal', 'Hoboken', STATION_ID),
         STATION_ID = ifelse((STATION_ID == 'Middletown')&(COUNTY == 'Monmouth'), 'Middletown NJ', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Montclair St Univ', 'Montclair State U', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Mountain View-Wayne', 'Mountain View', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Pennsauken Transit Center', 'Pennsauken', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Radburn', 'Radburn Fair Lawn', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Ramsey', 'Ramsey Main St', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Rte 17 Ramsey', 'Ramsey Route 17', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Secaucus Junction Lower Level', 'Secaucus Lower Lvl', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Teterboro-Williams Ave', 'Teterboro', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Watsessing', 'Watsessing Avenue', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Wayne Route 23 Transit Center', 'Wayne-Route 23', STATION_ID),
         STATION_ID = ifelse(STATION_ID == 'Wood-Ridge', 'Wood Ridge', STATION_ID),
         STATION_ID = ifelse(STATION_ID == '30th Street Station', 'Philadelphia', STATION_ID),
         line_intersct = str_count(RAIL_SERVICE, ",") + 1)%>%
  dplyr::select(STATION_ID,LATITUDE,LONGITUDE,line_intersct)%>%
  st_drop_geometry()

```

```{r, include=FALSE}
merged_dataset <- merge(njtransit, stop, by.x = "from", by.y = "STATION_ID",all.x = TRUE) 
# merge from 
merged_dataset <- merge(merged_dataset, stop, by.x = "to", by.y = "STATION_ID",all.x = TRUE) 
# merge to

merged_dataset <- merged_dataset%>%
  filter(to != 'Mount Arlington')%>%
  filter(from != 'Mount Arlington')%>%
  rename(from_lat = LATITUDE.x,
         from_lon = LONGITUDE.x,
         from_inter = line_intersct.x,
         to_lat = LATITUDE.y,
         to_lon = LONGITUDE.y,
         to_inter = line_intersct.y
         )%>%
  mutate(distance = distHaversine(cbind(from_lon, from_lat), cbind(to_lon, to_lat)),
         interval60 = floor_date(ymd_hms(scheduled_time), unit = "hour"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"),
         weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")) 


```

# Exploratory Analysis

But before jumping right into the model building part, we first must conduct a thorough exploratory analysis on the data we plan to work with. In this portion of the report, we will be analyzing the weather factor (precipitation and temperature), fixed effects (time of the day, day of the week, delay distribution), and the temporal as well as the spatial lags to see how the delays respond to each of these factors.

## Weather

Since most of the NJ transit stations are in New Jersey state, weather data from Newark Airport (EWR) is used. The following time series plots displays the precipitation and temperature for each hour in EWT during March 2018. The plots reveals a few days of precipitation.

```{r, include=FALSE}
weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2018-03-01", date_end = "2018-03-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)

```

```{r}
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  #ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    #labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - EWR - March, 2018")
```

## Describe and Explore the Data

```{r, include=FALSE}
# census data
NJ_Census <- 
  get_acs(geography = "county subdivision", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2018, 
          state = "NJ", 
          geometry = TRUE, 
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
NJTracts <- 
  NJ_Census %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf
```

```{r, include=FALSE}
dat_census <- st_join(merged_dataset %>% 
          filter(is.na(from_lon) == FALSE &
                   is.na(from_lat) == FALSE &
                   is.na(to_lat) == FALSE &
                   is.na(to_lon) == FALSE) %>%
          st_as_sf(., coords = c("from_lon", "from_lat"), crs = 4326),
        NJTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(From.Tract = GEOID) %>%
  mutate(from_lon = unlist(map(geometry, 1)),
         from_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry) %>%
  st_as_sf(., coords = c("to_lon", "to_lat"), crs = 4326) %>%
  st_join(., NJTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(To.Tract = GEOID)  %>%
  mutate(to_lon = unlist(map(geometry, 1)),
         to_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

Friday <- 
  mutate(dat_census,
         Friday = ifelse(dotw == "Fri" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(Friday != 0)

monday <- 
  mutate(dat_census,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0)

```

The Time Series Plot below illustrate Bike share tripe per hours in Jersey City. The skyblue line is used to visualize the Friday. In can see the relatively high delay occur in mid-March.

```{r, message=FALSE, warning=FALSE, results='hide'	}
merged_dataset%>%
  dplyr::select(interval60, from, delay_minutes) %>%
  gather(Variable, Value, -interval60, -from) %>%
    group_by(Variable, interval60) %>%
    summarize(Value = mean(Value))%>%
    ggplot(aes(interval60, Value)) + 
    #geom_vline(data = monday, aes(xintercept = monday), color = "red") +
    geom_vline(data = Friday, aes(xintercept = Friday), color = "skyblue") +
    geom_line(size = 0.8,colour="darkblue")+
      labs(title = "Delay distribution in A Month", subtitle = "NJ, March, 2018",  x = "Day", y= "Mean Delay") +
     theme_minimal()
  plotTheme
```

## Fixed effects

```{r, warning=FALSE, include=FALSE}
delay_time <- merged_dataset %>%
  group_by(time_of_day)%>%
    filter(!is.na(time_of_day)) %>%
  summarize(mean_delay = mean(delay_minutes))

delay_day <- merged_dataset %>%
  group_by(dotw)%>%
    filter(!is.na(dotw)) %>%
  summarize(mean_delay = mean(delay_minutes))

delay_week <- merged_dataset %>%
  group_by(weekend)%>%
  summarize(mean_delay = mean(delay_minutes))

delay_hour <- merged_dataset %>%
  group_by(hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

delay_sequence <- merged_dataset %>%
  group_by(stop_sequence)%>%
  summarize(mean_delay = mean(delay_minutes))

delay_time_week <- merged_dataset %>%
  group_by(time_of_day,weekend)%>%
  summarize(mean_delay = mean(delay_minutes))

delay_week_hour <- merged_dataset %>%
  group_by(weekend,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

delay_week_day <- merged_dataset %>%
  group_by(dotw,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')
```

```{r, warning=FALSE}
grid.arrange(ggplot(data = delay_day, aes(x = dotw, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in a week", x = "Day of The Week", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_hour, aes(x = hour, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in 24 hours", x = "Hour in a day", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_sequence, aes(x = stop_sequence, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in each sequence", x = "Stop Sequence", y = "Mean Delay") +
  theme_minimal(),
  
  ggplot(data = delay_time, aes(x = time_of_day, y = mean_delay)) +
  geom_bar(stat = "identity",fill = "darkblue") +
  labs(title = "Delay minutes in a week", x = "Day of The Week", y = "Mean Delay") +
  theme_minimal(),
  
  nrow=2)

```

The following time series plots display delay minutes in NJ Transit, broken down by day of the week and by weekdays versus weekends. The data shows a peak in delay minnutes on Sunday, with overall higher delay on weekends compared to weekdays. This trend indicates that delay situation is more commonly happen in weekend.

```{r,warning=FALSE, message=FALSE, message=FALSE}

delay_week_hour <- merged_dataset %>%
  group_by(weekend,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

ggplot(data = delay_week_hour, aes(x = hour, y = mean_delay, color = weekend)) +
  geom_line() +
  scale_color_manual(values = palette2) +
  labs(title = "The delay under week and time", x = "Hour", y = "Delay Minutes") +
  theme_minimal()
  
```

```{r, warning=FALSE, message=FALSE}
palette <- c(
  "#1b9e77", # Teal green
  "#d95f02", # Vibrant orange
  "#7570b3", # Muted purple
  "#e7298a", # Bright pink
  "#66a61e", # Leaf green
  "#e6ab02", # Golden yellow
  "#a6761d"  # Warm brown
)

delay_week_day <- merged_dataset %>%
  group_by(dotw,hour(interval60))%>%
      filter(!is.na(dotw)) %>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')

ggplot(data = delay_week_day, aes(x = hour, y = mean_delay, color = dotw)) +
  geom_line() +
  scale_color_manual(values = palette) +
  labs(title = "The delay under week and time", x = "Hour", y = "Delay Minutes") +
  theme_minimal()

```

```{r, include=FALSE}
merged_dataset %>%                
  filter(!is.na(time_of_day)) %>%
  group_by(from, time_of_day)%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  ggplot()+
  geom_histogram(aes(mean_delay), binwidth = 1)+
  labs(title="Mean delay min. NJ Transit, March, 2018",
       x="Mean delay min", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme
```

The maps below illustrate the hourly delays by station. Weekends tend to experience relatively higher delays compared to weekdays. Overall, delayed stations appear to be spatially distributed in a random pattern, without a clear clustering or specific geographic concentration.

```{r}

ggplot() +
  # Plot NJTracts as a base map
  geom_sf(data = NJTracts %>% 
            st_transform(crs = 4326), color = "white") +
  
  # Add points for bike share trips
  geom_point(data = merged_dataset %>% 
               mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(
                        hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                        hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                        hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                        hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"
                      )) %>%
               filter(!is.na(time_of_day)) %>%
               group_by(from_id, from_lat, from_lon, weekend, time_of_day) %>%
               summarize(mean_delay = mean(delay_minutes, na.rm = TRUE), .groups = "drop") %>%
               ungroup(),
             aes(x = from_lon, y = from_lat, color = mean_delay), 
        
             alpha = 0.9, size = 0.9) +
  
  # Add color scale for mean delay
  scale_colour_viridis(direction = -1, discrete = FALSE, option = "E") +
  
  # Set axis limits
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat)) +
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon)) +
  
  # Add facets for weekend and time of day
  facet_grid(weekend ~ time_of_day) +
  
  # Add labels and themes
  labs(title = "delay min by station. NJ, march, 2018") +
  theme_minimal() +
   # Remove longitude and latitude from axes
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  theme_void()

```

The plot below displays an animated map of station delays, revealing a relatively high frequency of delays in the northern area near Manhattan, NY. This suggests that stations closer to Manhattan experience more frequent delays, potentially due to higher traffic, congestion, or operational complexity in this region.

```{r, message=FALSE, warning=FALSE}
delay_stop_time <- merged_dataset %>%
  group_by(from,to,hour(interval60))%>%
  summarize(mean_delay = mean(delay_minutes))%>%
  rename(hour = 'hour(interval60)')%>%
  left_join(stop,by=c('from'='STATION_ID'))%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)

ggplot() +
    geom_sf(data = NJTracts, color = 'white') + 
    geom_sf(data = delay_stop_time, aes(size = mean_delay,color = mean_delay), 
            fill = "transparent", alpha = 0.8) +
  
    scale_colour_viridis(
          name = "Mean delay", # Combined title for color and size
    direction = -1,
    option = "E",
    discrete = FALSE
  ) +
  scale_size_continuous(
    name = "Mean delay", # Same title as color to combine legend
    range = c(1, 5) # Adjust size range as needed
  ) +
  guides(
    color = guide_legend() # Use a single legend guide
  ) +

    #scale_colour_viridis(direction = -1,discrete = FALSE, option = "D") +
    labs(title = "Station Delay For One Day in March, 2018",
         subtitle = "Hours in a day: {current_frame}") +
    transition_manual(hour)+ mapTheme+theme_minimal()+  theme_void()

```

```{r, message=FALSE, include=FALSE}
study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              from_id = unique(dat_census$from_id)) %>%
  left_join(., dat_census %>%
              select(from_id, from, From.Tract, from_lat, from_lon )%>%
              distinct() %>%
              group_by(from_id) %>%
              slice(1))

nrow(study.panel)
```

```{r, message=FALSE, include=FALSE}
ride.panel_1 <- 
  dat_census %>%
  right_join(study.panel) %>% 
  group_by(scheduled_time, interval60, from_id, from, From.Tract, from_lat, from_lon, delay_minutes,stop_sequence,train_id, to, weekend, dotw) %>%
  summarize(mean_delay = mean(delay_minutes, na.rm = TRUE), .groups = "drop") %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(from_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(From.Tract) == FALSE) %>% 
  mutate(interval15 = floor_date(ymd_hms(scheduled_time), unit = "15 mins"))

```

## Temporal & Spatial lag

Creating Spatial and time lag variables capture the dependence of phenomena on geospatial and time series, thereby improving the predictive accuracy and explanatory power of the model. In practical applications, these variables can help us better understand the dynamic nature of delay and external influences.

```{r, include=FALSE}
ride.panel <- 
  left_join(ride.panel_1, NJ_Census %>%
              as.data.frame() %>%
              select(-geometry), by = c("From.Tract" = "GEOID"))


ride.panel_min <- 
  ride.panel %>% 
  arrange(from_id, interval15) %>% 
mutate(lag15min = dplyr::lag(delay_minutes,1),
         lag30min = dplyr::lag(delay_minutes,2),
         lag45min = dplyr::lag(delay_minutes,3),
         lag1h = dplyr::lag(delay_minutes,4),
         lag1h15min = dplyr::lag(delay_minutes,5),
         lag1h30min = dplyr::lag(delay_minutes,6),
         lag1h45min = dplyr::lag(delay_minutes,7),
         lag2h = dplyr::lag(delay_minutes,8),
         lag2h15min = dplyr::lag(delay_minutes,9),
         lag2h30min = dplyr::lag(delay_minutes,10),
         lag2h45min = dplyr::lag(delay_minutes,11),
         lag3h = dplyr::lag(delay_minutes,12),
         #lag1day = dplyr::lag(delay_minutes,96)
       ) %>% 
   mutate(min_15 = yday(interval15))  

as.data.frame(ride.panel_min) %>%
    group_by(interval15) %>% 
    summarise_at(vars(starts_with("lag"), "delay_minutes"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval15, -delay_minutes) %>%
    mutate(Variable = factor(Variable, levels=c("lag15min","lag30min","lag45min","lag1h",
                                                "lag1h15min","lag1h30min","lag1h45min","lag2h","lag2h15min",
                                                "lag2h30min","lag2h45min","lag3h")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, delay_minutes, use="complete.obs"),2))


```

```{r, include=FALSE}
ride.panel_station <- 
  ride.panel_min %>% 
  arrange(train_id, interval15, stop_sequence) %>% 
  mutate(
    lagsstation = if_else(stop_sequence == 1, 0, lag(delay_minutes, 1)),
    lags2station = if_else(stop_sequence <= 2, 0, lag(delay_minutes, 2)),
    lags3station = if_else(stop_sequence <= 3, 0, lag(delay_minutes, 3)),
    lags4station = if_else(stop_sequence <= 4, 0, lag(delay_minutes, 4)),
    lags5station = if_else(stop_sequence <= 5, 0, lag(delay_minutes, 5)),
    lags6station = if_else(stop_sequence <= 6, 0, lag(delay_minutes, 6)),
    lags7station = if_else(stop_sequence <= 7, 0, lag(delay_minutes, 7)),
    lags8station = if_else(stop_sequence <= 8, 0, lag(delay_minutes, 8))
  )

as.data.frame(ride.panel_station) %>%
  group_by(train_id) %>% 
  summarise_at(vars(starts_with("lags"), "delay_minutes"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -train_id, -delay_minutes) %>% # Remove `interval15` if unnecessary
  mutate(Variable = factor(Variable, levels = c(
    "lagsstation", "lags2station", "lags3station", "lags4station",
    "lags5station", "lags6station", "lags7station", "lags8station"
  ))) %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2))

```

The time lag variables considers how a phenomenon at one point in time is affected by a previous point in time.

The following scatterplots present that Temporal lag has correlation with the delay.

Short time lags (such as lag15min and lag30min) have a greater impact on current delays and should be prioritized as features in the model. However, Long-term lags (e.g., lag2h and lag3h) are less relevant but still capture long-term delay trends.

```{r fig.width=15, fig.height=5, warning=FALSE, message=FALSE}

# Prepare the data
scatter_data <- as.data.frame(ride.panel_min) %>%
  group_by(interval15) %>% 
  summarise_at(vars(starts_with("lag"), "delay_minutes"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval15, -delay_minutes) %>%
  mutate(Variable = factor(Variable, levels = c("lag15min", "lag30min", "lag45min", "lag1h",
                                                "lag1h15min", "lag1h30min", "lag1h45min", "lag2h", "lag2h15min",
                                                "lag2h30min", "lag2h45min", "lag3h") )) %>% 
    filter(!is.na(Variable))


# Compute correlation values
correlations <- scatter_data %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2)) %>% 
  filter(!is.na(Variable))

# Create scatterplot
ggplot(scatter_data, aes(x = Value, y = delay_minutes)) +
  geom_point(color = "grey20", alpha = 0.6, size = 1) + # Uniform color for points
  geom_smooth(method = "lm", color = "red", se = FALSE, size = 0.5, linetype = "dashed") + # Best-fit line
  facet_wrap(~Variable, scales = "free", nrow=2) + # Separate scatterplots for each Variable
  labs(
    title = "Scatterplot of Lag Variables vs Delay Minutes",
    x = "Lag Value",
    y = "Delay Minutes"
  ) +
  theme_minimal() +
  # Add correlation values as annotations
  geom_text(
    data = correlations,
    aes(label = paste("r =", correlation), x = Inf, y = Inf),
    hjust = 1.1, vjust = 1.2, inherit.aes = FALSE
  )


```

Spatial lag variables consider how phenomena in one location are affected by neighboring locations, especially in geographic space. The following scatterplots indicate that spatial lag are very strong correlation with the delay. However, the correlation are decreasing with each additional lag stations, but the predictive power are stronger in overall.

```{r fig.width=15, fig.height=5, warning=FALSE, message=FALSE}

# Prepare the data
scatter_data_station <- as.data.frame(ride.panel_station) %>%
  group_by(train_id) %>% 
  summarise_at(vars(starts_with("lags"), "delay_minutes"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -train_id, -delay_minutes) %>%
  mutate(Variable = factor(Variable, levels = c(
    "lagsstation", "lags2station", "lags3station", "lags4station",
    "lags5station", "lags6station", "lags7station", "lags8station"
  )))

# Compute correlation values
correlations_station <- scatter_data_station %>%
  group_by(Variable) %>%
  summarize(correlation = round(cor(Value, delay_minutes, use = "complete.obs"), 2))

# Create scatterplot
ggplot(scatter_data_station, aes(x = Value, y = delay_minutes)) +
  geom_point(color = "grey20", alpha = 0.6, size = 2) + # Uniform color for points
  geom_smooth(method = "lm", color = "red", se = FALSE, size = 1, linetype = "dashed") + # Best-fit line
  facet_wrap(~Variable, scales = "free", nrow=1) + # Separate scatterplots for each Variable
  labs(
    title = "Scatterplot of Station Lags vs Delay Minutes",
    x = "Lag Value",
    y = "Delay Minutes"
  ) +
  xlim(0,25)+
ylim(0,25)+
  theme_minimal() +
  # Add correlation values as annotations
  geom_text(
    data = correlations_station,
    aes(label = paste("r =", correlation), x = Inf, y = Inf),
    hjust = 1.1, vjust = 1.2, inherit.aes = FALSE
  )

```

# Model

This section split data into a training and test set and develop a 3 week training set and a 2 week test set of all the stations. In the following analysis, Three different linear regression are estimated on `delay.Train`.

Since delays exhibit spatial and temporal correlations, incorporating time and space features into the models can lead to improved predictions. Additionally, variables such as the origin and destination stations (from and to), whether it is a weekday or weekend, temperature, precipitation, and stop sequence are also included to further enhance the model's accuracy.

1.  reg.30 (Predict delays within 30 minutes)

-   Shorter lag variables are used, such as lag45min, lag1h, lag1h15min, etc. These lag variables correspond to the possible effects in the target time range.

-   At the same time, lags3station, lags4station, lags5station, and lags6station contain lagging site variables, indicating the propagation of delay in space.

-   It is suitable for analyzing the prediction of delays in a shorter time range (within 30 minutes).

2.  reg.60 (Forecast delay within 60 minutes)

-   Some lag variables of shorter time (such as lag45min) are omitted, and more attention is paid to lag variables of more than 1 hour, such as lag1h30min, lag2h, lag2h30min, etc.

-   Simplified site lag variables, including only lags5station and lags6station.

-   It is suitable for analyzing predictions of medium time range (within 60 minutes) delays.

3.  reg.90 (Predict delays within 90 minutes)

-   Focus on longer time range lag variables, such as lag1h30min, lag2h, lag3h, and remove short time lag variables (such as lag45min).

-   Only one site lag variable, lags6station, is retained.

-   Suitable for analyzing predictions of long time ranges (within 90 minutes).

```{r}
delay.Train <- filter(ride.panel_station, week <= 11)
delay.Test <- filter(ride.panel_station, week > 11)


reg.30 <- 
  lm(delay_minutes ~ from + to + weekend + Temperature + Precipitation + lag45min 
     + lag1h + lag1h15min + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + 
       lag2h45min + lag3h 
     + lags3station+ lags4station+ lags5station + stop_sequence  , 
     data=delay.Train)

reg.60 <- 
  lm(delay_minutes ~  from + to + weekend + Temperature + Precipitation 
    + lag1h + lag1h15min + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + lag2h45min + lag3h + lags5station + lags6station + stop_sequence , 
     data=delay.Train)

reg.90 <- 
  lm(delay_minutes ~  from + to  + weekend + Temperature + Precipitation 
    + lag1h30min + lag1h45min + lag2h + lag2h15min + lag2h30min + lag2h45min + lag3h+ lags6station + stop_sequence , 
     data=delay.Train)
```

```{r, warning=FALSE}
delay.Test.weekNest <- 
   delay.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}
```

The Bar charts below illustrates the MAE values for the Three models. It displays the highest MAE value in the 90 min delay model, and the lowest MAE value in the 30-min delay model. This indicates that the 30-minute model is the most accurate among the three.

```{r, warning=FALSE}
week_predictions <- 
  delay.Test.weekNest %>% 
        mutate(mod.30 = map(.x = data, fit = reg.30, .f = model_pred),
           mod.60 = map(.x = data, fit = reg.60, .f = model_pred),
           mod.90 = map(.x = data, fit = reg.90, .f = model_pred),
           ) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, delay_minutes),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme
```

The time series plot below shows the predicted and observed mean delay. Overall, the models tend to under-predict the observed values. Among all regression models, the 30-minute model is the most accurate, closely capturing the observed delay patterns. The longer time prediction (such as 60 and 90 min model) typically introduce greater uncertainty, so the accuracy decrease while the prediction horizon increases (60 and 90 minutes).

```{r, warning=FALSE}
delay.Test_5 <- delay.Test %>%
  mutate(pre_5 = predict(reg.30, newdata = delay.Test),
         abosulte_error = abs(pre_5 - delay_minutes),
         MAE = mean(abosulte_error, na.rm = TRUE),
         sd_AE = sd(abosulte_error, na.rm = TRUE),
         per_error = (pre_5 - delay_minutes) / delay_minutes,
         per_error = ifelse(per_error == Inf, 0, per_error),
         per_error = ifelse(per_error == -Inf, 0, per_error)) %>%
  rename(mod_30 = pre_5)

delay.Test_6 <- delay.Test%>%
  mutate(pre_6 = predict(reg.60, newdata = delay.Test),
         abosulte_error = abs(pre_6 - delay_minutes),
         MAE = mean(abosulte_error,na.rm = TRUE),
         sd_AE = sd(abosulte_error,na.rm = TRUE),
         per_error = (pre_6 - delay_minutes)/delay_minutes,
         per_error = ifelse(per_error == Inf,0,per_error),
         per_error = ifelse(per_error == -Inf,0,per_error))%>%
  rename(mod_60 = pre_6)

delay.Test_7 <- delay.Test%>%
  mutate(pre_7 = predict(reg.90, newdata = delay.Test),
         abosulte_error = abs(pre_7 - delay_minutes),
         MAE = mean(abosulte_error,na.rm = TRUE),
         sd_AE = sd(abosulte_error,na.rm = TRUE),
         per_error = (pre_7 - delay_minutes)/delay_minutes,
         per_error = ifelse(per_error == Inf,0,per_error),
         per_error = ifelse(per_error == -Inf,0,per_error))%>%
  rename(mod_90 = pre_7)

grid.arrange(
delay.Test_5 %>%
  dplyr::select(interval60, from, delay_minutes, mod_30) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "30 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),

delay.Test_6 %>%
  dplyr::select(interval60, from, delay_minutes, mod_60) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "60 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),

delay.Test_7 %>%
  dplyr::select(interval60, from, delay_minutes, mod_90) %>%
  gather(Variable, Value, -interval60, -from) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = mean(Value, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(interval60, Value, colour = Variable)) + 
    geom_line(size = 0.9) +
    labs(
      title = "Predicted/Observed Delay Time Series", 
      subtitle = "90 Minutes Pre-Predict",  
      x = "Day", 
      y = "Mean Delay"
    ) +
    theme_minimal(),
 ncol=1)

```

```{r, warning=FALSE, include=FALSE}
week_predictions %>% 
    mutate(interval15 = map(data, pull, interval15),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval15, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "mod.30") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
ggplot(.)+
  geom_sf(data = NJ_Census, color = "grey50", fill="grey70")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_color_gradientn(
    name = "Mean Absolute Error",
    colors = c("lightyellow", "red")
  ) +
  guides(
    color = guide_legend() # Use a single legend guide
  ) +
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat))+
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon))+
  labs(title="Mean Abs Error,  Model 30 min")+
  mapTheme


week_predictions %>% 
    mutate(interval15 = map(data, pull, interval15),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval15, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "mod.60") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
ggplot(.)+
  geom_sf(data = NJ_Census, color = "grey50", fill="grey70")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_color_gradientn(
    name = "Mean Absolute Error",
    colors = c("lightyellow", "red")
  ) +
  guides(
    color = guide_legend() # Use a single legend guide
  ) +
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat))+
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon))+
  labs(title="Mean Abs Error,  Model 60 min")+
  mapTheme

week_predictions %>% 
    mutate(interval15 = map(data, pull, interval15),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon)) %>%
    select(interval15, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "mod.90") %>%
  group_by(from_id, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>%
ggplot(.)+
  geom_sf(data = NJ_Census, color = "grey50", fill="grey70")+
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_color_gradientn(
    name = "Mean Absolute Error",
    colors = c("lightyellow", "red")
  ) +
  guides(
    color = guide_legend() # Use a single legend guide
  ) +
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat))+
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon))+
  labs(title="Mean Abs Error,  Model 90 min")+
  mapTheme

```

The maps below present the MAE (Mean Absolute Error) values for three different models. As observed, the southern stations exhibit relatively higher MAE values compared to other regions, indicating greater prediction errors in these areas. Moreover, the maps below that still illustrates the 30-min model is the most accurate predictions, with smaller and more localized errors.

```{r, warning=FALSE}
# Combine the data for all models into one data frame
all_models_data <- week_predictions %>%
  mutate(interval15 = map(data, pull, interval15),
         from_id = map(data, pull, from_id), 
         from_lat = map(data, pull, from_lat), 
         from_lon = map(data, pull, from_lon)) %>%
  select(interval15, from_id, from_lon, from_lat, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression %in% c("mod.30", "mod.60", "mod.90")) %>% # Include only the models of interest
  group_by(from_id, from_lon, from_lat, Regression) %>%
  summarize(MAE = mean(abs(Observed - Prediction), na.rm = TRUE), .groups = "drop") %>%
  mutate(Model = case_when(
    Regression == "mod.30" ~ "30 min",
    Regression == "mod.60" ~ "60 min",
    Regression == "mod.90" ~ "90 min"
  ))

# Plot with facet_wrap
ggplot(all_models_data) +
  geom_sf(data = NJ_Census, color = "grey50", fill = "grey70") +
  geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.8) +
  scale_color_gradientn(name = "Mean Absolute Error", colors = c("lightyellow", "red")) +
  facet_wrap(~Model) +
  ylim(min(merged_dataset$from_lat), max(merged_dataset$from_lat)) +
  xlim(min(merged_dataset$from_lon), max(merged_dataset$from_lon)) +
  labs(title = "Mean Absolute Error by Model") +
  mapTheme

```

```{r, warning=FALSE, message=FALSE, include=FALSE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_latitude = map(data, pull, from_lat), 
           from_longitude = map(data, pull, from_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, from_id, from_longitude, 
           from_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "mod.30")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
          time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed delay time", 
       y="Predicted delay time")+
  plotTheme
```

The maps below display the distribution of MAE values for NJ transit stations during different time periods on weekdays and weekends using the 30-minute model. Overall, weekends exhibit relatively higher MAE values compared to weekdays. Meanwhile, the model performs better during weekdays, as indicated by lower MAE values across most stations and time periods.

```{r, warning=FALSE, fig.width=15, fig.height=12, message=FALSE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           from_id = map(data, pull, from_id), 
           from_lat = map(data, pull, from_lat), 
           from_lon = map(data, pull, from_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, from_id, from_lon, 
           from_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "mod.30")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(from_id, weekend, time_of_day, from_lon, from_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = NJ_Census, color = "white")+
   geom_point(aes(x = from_lon, y = from_lat, color = MAE), 
             fill = "transparent", alpha = 0.5, size = 3) +
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "E") +
  ylim(min(dat_census$from_lat), max(dat_census$from_lat))+
  xlim(min(dat_census$from_lon), max(dat_census$from_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, 30 min Model")+
  mapTheme
```

# Conclusion

All three of our predictive models can do a decent job at predicting potential delays for NJ transit trains with a MAE of less than 10 minutes. With the help of the mobile app, people who commute to work or school, those who frequently travel using public transit, and those with a tight schedule can benefit from the relative accurate predicted delay time provided by our models. Not only could this help them save time and stay away from unexpected delays, but it can also help them with their traveling behavior while offering them with alternative plans. To make the analysis better, we can always include more relevant information and factors that might affect train delay time for a more accurate and comprehensive understanding of the problem we are dealing with. We can dig deeper into the study and analyze the delay times for each of the stations to find delay patterns that we might overlooked with our macro approach. Another thing we can do is to include data from the adjacent cities’ transit system to understand how they affect the delay times in New Jersey.
